The plotly package. A godsend for interactive documents, dashboard and presentations. For such documents there is no doubt that anyone would prefer a plot created in plotly rather than ggplot2. Why? Using plotly gives you neat and crucially interactive options at the top, where as ggplot2 objects are static. In an app we have been devloping here at Jumping Rivers, we found ourselves asking the question would it be quicker to use plot_ly or wrapping a ggplot2 object in ggplotly? I found the results staggering.
Throughout we will be using the packages: dplyr, ggplot2, plotly and microbenchmark. The data in use is the birthdays dataset in the mosaicData package. Containing the counts of births in USA on each day from 1969 - 1988.
install.packages("mosaicData")
library(mosaicData)
install.packages("dplyr")
library(dplyr)
install.packages("ggplot2")
library(ggplot2)
install.packages("plotly")
library(plotly)
install.packages("microbenchmark")
library(microbenchmark)
Let’s take a look at the data.
b = mosaicData::Birthdays
head(b)
## state year month day date wday births
## 1 AK 1969 1 1 1969-01-01 Wed 14
## 2 AL 1969 1 1 1969-01-01 Wed 174
## 3 AR 1969 1 1 1969-01-01 Wed 78
## 4 AZ 1969 1 1 1969-01-01 Wed 84
## 5 CA 1969 1 1 1969-01-01 Wed 824
## 6 CO 1969 1 1 1969-01-01 Wed 100
Let’s start off with a very simple scatter graph of the mean births in every year.
meanb = b %>% group_by(year) %>% summarise(mean = mean(births))
ggplotly(ggplot(meanb) +
geom_point(aes(y = mean, x = year, colour = year)))
plot_ly(data = meanb,
y = ~mean, x = ~year, color = ~year,
type = "scatter")
Both graphs identical bar styling, yes?
Now let’s use microbenchmark to see how their timings compare.
time = microbenchmark::microbenchmark(
ggplotly = ggplotly(ggplot(meanb) +
geom_point(aes(y = mean, x = year, colour = year))),
plotly = plot_ly(data = meanb,
y = ~births, x = ~year,
color = ~year, type = "scatter"),
times = 100, unit = "s")
time
## Unit: seconds
## expr min lq mean median uq
## ggplotly 0.060465833 0.064966002 0.08341875 0.067533562 0.070341152
## plotly 0.005938955 0.006632689 0.00771174 0.007236614 0.008009191
## max neval cld
## 1.60127363 100 b
## 0.03457796 100 a
autoplot(time)
Now I thought nesting a ggplot object within ggplotly() would be slower than using plot_ly(), but I didn’t think it would be this slow. On average ggplotly() is 11 times slower than plot_ly(). 11! One run even took 1.601 seconds!
Let’s take it up a notch. There we were just plotting 20 points, what about if we plot over 20,000? Here we will plot the min, mean and max births on each day.
meandate = b %>% group_by(date) %>% summarise(births = mean(births))
maxdate = b %>% group_by(date) %>% summarise(births = max(births))
mindate = b %>% group_by(date) %>% summarise(births = min(births))
all = rbind(meandate, maxdate, mindate)
all$stat = rep(c("mean","max", "min"), each = 7305)
ggplotly(ggplot(all) + geom_point(aes(y = births, x = date, colour = stat)))
plot_ly(all, x = ~date, y = ~births, color = ~stat, type = "scatter")
Again, both plots are identical bar styling.
time2 = microbenchmark(ggplotly =
ggplotly(ggplot(all) +
geom_point(aes(y = births, x = date, colour = stat))),
plotly = plot_ly(all, x = ~date, y = ~births,
color = ~stat, type = "scatter"),
times = 100, unit = "s")
time2
## Unit: seconds
## expr min lq mean median uq
## ggplotly 0.342929589 0.362489416 0.382971125 0.370923618 0.380537679
## plotly 0.005959482 0.006483164 0.006973089 0.006705234 0.007153338
## max neval cld
## 0.54437071 100 b
## 0.01080022 100 a
autoplot(time2)
On average ggplotly() is 55 times slower than plot_ly(), with the max run time being 0.544 seconds!